Red Wine Quality Data Exploration

The wine quality dataset was created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009, using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Many of the variables look normally distributed. Chlorides, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide look like they have lognormal distributions. Let’s exclude the 95th percentile for all these five features and re-plot their histograms:

Univariate Analysis

What is the structure of your dataset?

Number of red wine instances: 1599 Number of Attributes: 1 Serial Number + 11 Attributes + 1 Output Attribute

11 Attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Residual sugar, fixed acidity, pH, density and alcohol content may help support the investigation into the quality.

Did you create any new variables from existing variables in the dataset?

No, I didn’t.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Attributes of chlorides, total sulfur dioxide, and free sulfur dioxide, sulphates, alcohol were all appeared to be long tailed and were log-transformed which revealed a normal distribution for each.

Bivariate Plots Section

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

With our main feature of the dataset, the positive correlation coefficients which are more then 0.1 are:

 alchol:quality = 0.48
 sulphates:quality = 0.25
 citric.acid:quality = 0.23
 fixed.acidity:quality = 0.12

So alcohol content has a high correlation with red wine quality. Other important attributes correlated with red wine quality include sulphates, citric acid and fixed acidity.

With our main feature of the dataset, the negative correlation coefficients which are less then -0.1 are:

 volatile.acidity:quality = -0.39
 total.sulfur.dioxide:quality = -0.19
 density:quality = -0.17
 chlorides:quality = -0.13

So we see that volatile acids are negatively correlated with red wine quality, as described from the document that is at too high of levels can lead to an unpleasant, vinegar taste. Total sulfur dioxide, density and chlorides are also negatively correlated with quality.

Besides, other attributes wiht the highest (positive or negative) correlation are:

 fixed.acidity:pH = -0.68
 fixed.acidity:citric.acid = 0.67
 fixed.acidity:density = 0.67
 free.sulfur.dioxide:total.sulfur.dioxide = 0.67
 volatile.acidity:citirc.acid = -0.55
 citric.acid:pH = -0.54
 density:alcohol = -0.50

As we all know, the stronger the acid is, the lower pH will be. So it is make sence that either fixed acidity or citric acid has a high negative correlation with pH. I will focus on several other highest correlation relationships in a bit more detail.

“Fixed Acidity VS Citric Acid” and “Volatile Acidity VS Citirc Acid”

“Fixed Acidity VS Density” and “Alcohol VS Density”

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As of the quality, it appears that when alchol or sulphates is in higher amounts, the quality will be better also. However, the amount of volatile acidity is negatively correlated with the quality. It is likely that fresher wines avoid the bitter taste of acetic acid.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As of citric acid, fixed acidity is positively correlated with the citric acid, but the amount of volatile acidity is opposite. As of density, fixed acidity is also positively correlated with the citric acid, but the amount of alcohol is opposite.

What was the strongest relationship you found?

From the variables analyzed, the strongest relationship was between fixed.acidity and pH, which had a correlation coefficient of 0.68.

Multivariate Plots Section

Now let’s visualize the relationship between sulphates, volatile.acidity, alcohol and quality: Let’s try to summarize quality using a contour plot of volatile acidity and sulphate content: Let’s try to summarize quality using a contour plot of citric acid and alcohol content:

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Based on the multivariate analysis, five features stood out to me: alcohol, sulphates, citric acid, volatile acidity, and quality. Volatile acidity with amount between 0.3 and 0.5 and sulphates with amount between 0.6 and 0.9 were a strong indicator of the presence of good wine. Also, high alcohol content and higher citric acid have more chance to make for a good wine.

Final Plots and Summary

Plot One

As analyzing relationship between quality and other 11 attributes, the strongest correlation coefficient was found between alcohol and quality.

## # A tibble: 6 x 2
##   quality     n
##     <int> <int>
## 1       3    10
## 2       4    53
## 3       5   681
## 4       6   638
## 5       7   199
## 6       8    18
## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Description One

Clearly we see that the box plots for higher quality red wines are up shifted, meaning they have a comparatively higher alcohol content, compared to the lower quality red wines.

Plot Two

Description Two

Observe that lower sulphates content typically leads to a bad wine with alcohol varying between 9% and 12%. Average wines have higher concentrations of sulphates, however wines that are rated 6 tend to have higher alcohol content and larger sulphates content. Excellent wines are mostly clustered around higher alcohol contents and higher sulphate contents.

Plot Three

Description Three

This shows that higher quality red wines are generally located near the range from 0.25 to 0.65 of citric acid and slso near the higher alcohol which is more than 10.5%. Whereas lower quality red wines are generally with lower either alcohol or citric acid.


Reflection

The red wine dataset contains information on 1,599 red wine instances, 11 attributes and one output attribute. Initially, I tried to get a sense of how is each attribute changing on their own. All univariate plots have been arranged together. Many of the variables look normally distributed. However, chlorides, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide look like they have lognormal distributions. So I exclude the 95th percentile for all above five features and re-plot their histograms.

Then, I tried to find what factors might affect the quality of the wine. At this moment, pearson correlation coefficient can help us to visualize the relationship between each pair of variables. Using the insights from correlation coefficients provided by the paired plots, it was interesting exploring quality using box plots with a different color for each quality. Besides, melting the dataframe and using facet grids was really helpful for visualizing the distribution of the parameters with the use of scatter plots. Finally, using a contour plot of wine quality with a point plot of volatile acidity and alcohol would be a good choice to show that either the lower volatile acidity or higher alcohol have more possible to make a better wine. The result makes sense. Volatile acidity is mostly caused by bacteria in the wine which is the amount of acetic acid in wine. It can lead to an unpleasant, vinegar taste if at too high of levels.

Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.